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Disclaimer 


This talk represents my own personal view and opinion. 


lt does not necessarily reflect the official stance of 
The Apache Software Foundation/SberTech/ 


any company/any other entity the author might be affiliated with at the 
moment of presenting or in the past. 
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How to find an ideal match node 


How to find some entry and update it 
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How the storage and redo log are organized 


Over speed protection 


Apache Ignite 
Intro 


1 Ignite cluster — N Caches 


ite Cache - key-value storage 
Ui - put(k,v) 


- v=get(k) 


Sigh 


is a Distributed Database 
for High-Performance Computing 


In Memory Data Grid — yes 
In Memory Database - yes 
SQL support - yes 

SQL database — not fully 


with In-Memory Speed 
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IN-MEMORY DATA GRID DISTRIBUTED DATABASE 

= Stores data in-memory * Memory-centric database, since V2.1 
(data grid) * Scalable: Each node stores only it's 

+ Compute - code goes to data own data part 


(compute grid) 


Ignite can be used in combined mode (part in-memory, part - persisted) 


External data source (DB, REST, other) 
* Yes - Cache 
* No- Ignite is Primary storage 
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Ignite Cluster 


Compute & 
Data Nodes 


Client 
Connectors O 
O 


Public € Private 


e Servers — can store 
data 


e Clients (thick) — put/get 
e + Thin clients 


e data distribution is 
handled by affinity 
function (Rendezvous 


Hashing) 


1^ HighLoad 
GA ep 


Apache Ignite Native Persistence 


is a distributed ACID and PE em en DÉI arme 


SQL-compliant disk store U 


All data is on-disk, 
part of the data is in-memory Ignite Cluster 


Each node has its own local storage 


Apache Ignite = 
Speed/Scale 
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Apache Ignite 
Cache 


Ignite Cache 


Cache main feature: K->V 
Cache ~ Java.util.Map 
JSR-107, JCache API 
Cache ~ Table 


Cache usually stores 1 business entry 


Entry = K+V 
1 Cache - N partitions 


K -> hash(K) ->partition -> node 


Cache X 


BON EJEA EZ 
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Ignite Cache types 


* Partitioned and Replicated 
e Replicated - Cache A 
e Partitioned - Cache B & C 


e Replicated, use case: 
rare write, often - read, e.g. dictionary 


e Partitioned — most common 
1024 default 


e Backups 0,1... 


Replicated 


Partitioned 
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Data location in the cluster 
Affinity 


Cache Key to node: Affinity Function 


Bi Marshall 
inaryMarshaller Affinity key 


Affinity key-2-part part-2-node 


Key - is a key which will be 
Å , 
Dro Drammen | used to determine a 


node 
Field 2 AffinityFunction 
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Cache Key to node: Affinity Function 


BinaryMarshaller 


Affinity key-2-part part-2-node 


Key 
D hash(AK) [| Partition ID 
Field 2 AffinityFunction 


Key (full) is needed for 


primary index inside 
partition. 


Key value to be 
used in primary 
index (cache get) 


Key and indexed fields 
serialized in the similar 
manner 


Field 1 
Field 2 
Field 3 


Field value to be 
used in building 
index 
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Affinity function alternatives 


* Naive : targetNode = K.hashCode() mod nodeCount 
* Consistent Hashing (no more used in Ignite) 


https://en.wikipedia.org/wiki/Consistent hashin 


http://theory.stanford.edu/"tim/s16/1/11.pdf 


* Rendezvous — used by Ignite 


https://en.wikipedia.org/wiki/Rendezvous hashin 
http://www.eecs.umich.edu/techreports/cse/96/CSE-TR-316-96.pdf 
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Rendezvous or Highest Random Weight (HRW) 


Minimizes rebalancing on nodes set changes (leave node, join 


node) 


* Naive approach node = K.hashCode() % nodes 
* Imagine node add event 


Consider case 1 cache 1024 partitions 


012345... 63 = 012012012... 
012345... 64 = 012301230... 


- Almost all partitions should migrate 
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Partition 0 


Partition 0 | Winner 2 
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Partition 0 
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t 
o 
c 
E 
3 


Winner 2 


? depends on 
new winner 
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Affinity gives ideal topology. 


(Cache partition target node. 
Actual data may be on the way ) 


Node 0 Node 1 


* Rebalancing: Moving data from one 
node to another 


* Actual get should go to old node 
until rebalancing finished 


* GridDhtPartitionFullMap (simply - 
node2part) 


Node 2 


* Indexes affected during rebalancing 


Index C, field 1 fl 
Index C, field Zei 
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LOCAL node: 
Locating a key 


Let's move closer to the disk 


* One node 

* One cache 

* One partition 

* One key 

* We have some value to find 

* Key is serialized by a marshaller (usually — binary) 


* Split value (and key) - to chunks/blocks/pages 


Node 0 
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All HDDs are block devices 


Durable (Page) memory 


Pages Identification 


e Ak 
* Index — int: 0,1,.. 


Page A 


ID of Page B 


* (+) Partition ID = Page Id - 


* Links ^ "Pointers" 
* Links survives memory-HDD-memory 


(not depend on real address) 
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How to find a key in our pages 


e B+ Tree e PK/Primary Index 
* Read optimized * for each partition 
e ~ Linked list (from pages) with top * Key Hash based 
levels * Key value compared (collision 
resistant) 


* Order of iteration - preferable for 
range lookups * Secondary index: value or value 


start in Index page. 
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B+ Tree 


Page Id 


Keys >= K4 


Child Page 


Keys < K1 
… <K3 
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B+ Tree 


https://www.cs.usfca.edu/*galles/visualization/BPlusTree.html 
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LOCAL node: 
Store Values 


Pages are allocated within region randomly 


Durable wm 


aw 44" Segment S 
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Data Page structure 


> Y AAA 
Entries 


Free space 


ce A, AAA 


K&V locator (in local node): FullPageld + item 


Find suitable page for insertion data 


Free lists structure 


Pages linked lists 
Page size available 


8-15 free bytes 


16-23 free bytes 


2032-2039 free 


2040-2048 free 


Long objects 


Object: 
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RAM structure overview 


Durable Memory 
e Regions — as configured — 1 : N i 
^ emule) | 


* Region has segments (depending on CPU 
count by default) 


— 


-— 
es Memory Segment 


ree Lists 
Meta 


* Segment - set of RAM pages 


* Pages has types and different formats 


* Pages are linked between each other 
(Cross segment) 
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Loaded Pages Map Segment Data 
Buckets 


Segment 1 


IN 


Segment 2 


o oo 


Na 
Vi mg 
Sos 
- 
- 
==. 


IGNITE LONG LONG HASH MAP LOAD FACTOR ^ 
DataStorageConfiguration (Capacity=loadF actor*maxPages in segment) 
setC oncurrencyLevel 
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LOCAL node: 
A Page Modify and Write 


Page R/W operations in RAM 


Page ID->real address in memory 


-1 atomic operation: resolution of ID to address & lock page 


Case of setting a field in the page: 
oAbstractDataPagelO#setFirstEntryOffset 
oPageUtils#putShort 
oGridUnsafe#putShort (long, short) 


osun.misc.Unsafe#putShort (long, short) 
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Read-Write (Memory & disc) 


// \gnite classes: * FilelO 


RandomAccessFileIO#write(ByteBuffer, long) 


FileChannelImpl#write(ByteBuffer, long) 
(>) IOUtilfwrite (FileDescriptor, ByteBuffer, long, 
NativeDispatcher) 
OUtil#writeFromNativeBuffer 
NativeDispatcher#pwrite (..., 


(DirectBuffer)varl).address() + (long) var5 
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IO Util implementation 


1f (varl instanceof DirectBuffer) { 
return 
writeFromNativeBuffer(varO,varl, var2, var4); 
} else { 


ByteBuffer var8 
= Util.getTemporaryDirectBuffer (var?) 
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Over speed protection 


WAL 


Write is not immediate after update 


Data comes to 
(1) In memory 
Pages become dirty 


2. Persist Update . wm 
d 


P, File 


(2) Write Ahead Log 
(3) Update/TX completed 


4. Checkpointing . 


d 


(4) Checkpointing = updating page store 
files = Background process 


| R File 
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WAL 


- WAL = ACID — A&D, properties 


https://en.wikipedia.org/wiki/Write-ahead loggin 


* Both logical 
* Set user.lastSeen=... 


e And physical 
e Change page PagelD=..., at offset 4 to NNNNNNNN 
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Operation example 


1. Set customer.lastSeen=xxx 


K->AK->Partition->Node 
Locking record & 
updating 

FileChannel.force() 
fsync()/fdatasync() in POSIX 
N *depends on mode 


Physical 1 d 


2. Commit 


WAL 


...Offloaded to HDD 
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Why 2 types of records 


After checkpoint 
-1 CP history for data (logical) 


In the middle of checkpoint 
-2 CP for structure (physical) 
-1 CP for data (logical) 


Persistent Store 


part-1.bin, file 


part-2.bin, file 
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More than 1 record for 1 put 


1 logical, 1 physical 

Field changes its length + free lists updates 

Field update for 2+ pages for long objects 

Indexed field updated, need to add and remove index B+Tree nodes 

Updates in index may require Split-Merge operations for pages (2-3 page affected) 


Ignite tracks modifications, so Tracking page will be updated 


BUT: Latest Apache Ignite can share byte payload between records. 
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Sync strategies 


e FSYNC — any case survive — OS crash and power 
* LOG ONLY - give the data to the OS - process crash 
e BACKGROUND - by timer, some records may be lost 


* Actual WAL is not one file 
* Set of files = segments 
- active, ready to be filled — Work 


- Finalized — archive 
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Actual WAL structure 


Fixed size 
WAL Records log() segment files 


= Worker threads: T1. logs T2. waits 


Work 
directory 


d- =|- > 


File Name: 
00.wal 01.wal 02.wal 08.wal ER N % WAL 
Segment Size 
000 | 00...00 00...00 (default 10) 


Active file rotation e 
Archiver thread 


Moves file from work 


Vv 


Archive 
directory | 
45.wal 46.wal 47 wal 48.wal File Name: N.wal, 
N - absolute segment index 
WAL History Size, 
Checkpoints 
Deleted at 
checkpoint end 
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Page store 
(main storage for caches Data) 


File Structure 


Cache_A Files 


eee 
Ld 
| 
k Co 


Cache_B Files 


| 
m.m 


— 


File per Partition 
Folder per cache 
WAL shared by all 
caches 

Indexes are shared 
by all partitions 


Ro 


WAL Archive 
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Page store: File Structure 


L 


DOBDO 


BEBDBD 


part-0.bin ..1.bin PN bin index.bin 
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Locate page placement: File Offset=Index*pageSize 


Page Store Cache “C1” Page Store Cache “C2” 


PO P1 PN PI PO P1 PN PI 
idx=0 idx=0 
cache-C1 


Gi e 


part-0.bin ..1.bin PN.bin index.bin part-0.bin ..1.bin PN.bin  index.bin 
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Checkpointing 


RAM to page store: checkpointing 


Periodic action: timer or Marking Pages 


dirty pages percent Cache "C" 


Fast STW collection of 
current dirty pages sets 


Checkpoint 
pages: Set<ld> 


It is our scope of data to be 
written to disk 


Saved page: dirty = 0 


checkpointLock.writeLock() 
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Process of writing 


Page conflict during 
checkpoint => Copy on 
write. 


Copy on write 
To CP pool 


CP buffer/pool 


Copy of page to 
be checkpointed 


Overflow protected by 
exponential back off 
always enabled 


a HighLoad++ 


AGENDA 
What is Apache Ignite and caches 
How to find an ideal match node 
How to find some entry and update it 
How the storage and redo log are organized 


=> Over speed protection ch, 


Å 
I 
l Pages left 
I 
Y 


Actual progress 


Average cp 
write speed 


Total pages to write 


# 
Pages written afid synced 


Estimated cp 
complete time 


Total pages 


Throttling zone 


4 
| Overspeed__ f 


Estimated cp 
complete time 


Mark dirty progress 
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Files layout 


[ Ignite work directory 


IgniteConfiguration = APR NER 
setWorkDirectory U) (consistent id} 


BD db 
E) {consistent id} 


Binary metadata 
local file store 


PersistentStoreConfiguration 
rsistenceStorePath 
pe E, cache-{cache-name} 


Ka cache-ignite-sys-cache 
= Checkpoint 
cp markers 


U.) {consistent id} — WAL work 


PersistentStoreConfiguration => . 
walArchivePath |__| archive 


ER {consistent id) 
EB marshaller 


PersistentStoreConfiguration 
walStorePath 


* Consistent Id — 
randomly or 
user specified node ID 


Not covered 


e Marshaller cache 


* Binary Metadata 
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Summary 


Do and Don't of today 


- Don't write odd staff to your DB 
- Length in bytes still matters 
- Separate WAL & Page Store 
- Don't set several nodes to share one HDD 


- Use SSD where possible 


Main don't of today + 
Don’t write your own database 


(if you still want hardcore, join Apache Ignite 


https: 


community 
dev@ignite.apache.org) 


ignite.apache.org/community/contribute.html 


Links 


http: 


https: 
https: 
https: 
https: 


https: 


ignite:abache.of 


apacheignite.readme.io/docs/durable-memor 
cwiki.apache.ore/confluence/displav/IGNITE/lenite+Durable+Memory+-+under+the+hood 
apacheignite.readme.io/docs/distributed-persistent-store 


cwiki.apache.ore/confluence/display/IGNITE/lenite+Persistent+Store+-+under+the+hood 


www.cs.usfca.edu/*galles/visualization/Algorithms.html 
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Should you have any question 


dpavlov@apache.or Apache Ignite 
or just google How to use 
"Dmitriy Pavlov Apache" user@ignite.apache.org 


Related to contribution 


Ignite dt 
<summit> dev@ignite.apache.org 


May, 25, Online and free 
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